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A  Learning  Based  Approach  to  Control  Synthesis  of  Markov  Decision 
Processes  for  Linear  Temporal  Logic  Specifications 

Dorsa  Sadigh,  Eric  S.  Kim,  Samuel  Coogan,  S.  Shankar  Sastry,  Sanjit  A.  Seshia 


Abstract — We  propose  to  synthesize  a  control  policy  for  a 
Markov  decision  process  (MDP)  such  that  the  resulting  traces 
of  the  MDP  satisfy  a  linear  temporal  logic  (LTL)  property. 
We  construct  a  product  MDP  that  incorporates  a  deterministic 
Rabin  automaton  generated  from  the  desired  LTL  property. 
The  reward  function  of  the  product  MDP  is  defined  from  the 
acceptance  condition  of  the  Rabin  automaton.  This  construction 
allows  us  to  apply  techniques  from  learning  theory  to  the 
problem  of  synthesis  for  LTL  specifications  even  when  the 
transition  probabilities  are  not  known  a  priori.  We  prove  that 
our  method  is  guaranteed  to  find  a  controller  that  satisfies  the 
LTL  property  with  probability  one  if  such  a  policy  exists,  and 
we  suggest  empirically  with  a  case  study  in  traffic  control  that 
our  method  produces  reasonable  control  strategies  even  when 
the  LTL  property  cannot  be  satisfied  with  probability  one. 

I.  Introduction 

Control  of  Markov  Decision  Processes  (MDPs)  is  a 
problem  that  is  well  studied  for  applications  such  as 
robotics  surgery,  unmanned  aircraft  control  and  control  of 
autonomous  vehicles  [1],  [2],  [3],  In  recent  years,  there  has 
been  an  increased  interest  in  exploiting  the  expressiveness 
of  temporal  logic  specifications  in  conUolling  MDPs  [4], 
[5],  [6].  Linear  Temporal  Logic  (LTL)  provides  a  natural 
framework  for  expressing  rich  properties  such  as  stability, 
surveillance,  response,  safety  and  liveness.  Traditionally, 
control  synthesis  for  LTL  specifications  is  solved  by  finding 
a  winning  policy  for  a  game  between  system  requirements 
and  environment  assumptions  [7],  [8], 

More  recently,  there  has  been  an  effort  in  exploiting  these 
techniques  in  designing  conUollers  to  satisfy  high  level  spec¬ 
ifications  for  probabilistic  systems.  Ding  et  al.  [6]  address 
this  problem  by  proposing  an  approach  for  finding  a  policy 
that  maximizes  satisfaction  of  LTL  specifications  of  the  form 
<f>  =  GFtt  A  if)  subject  to  minimization  of  the  expected  cost 
in  between  visiting  states  satisfying  n.  In  order  to  maximize 
the  satisfaction  probability  of  tj),  the  authors  appeal  to  results 
from  probabilistic  model  checking  [9],  [10].  The  methods 
used  for  maximizing  this  probability  take  advantage  of  com¬ 
puting  maximal  end  components,  which  are  not  well  suited 
for  partial  MDPs  with  unknown  probabilities.  We  present  a 
different  technique  that  does  not  require  preprocessing  of  the 
model.  Our  algorithm  learns  the  Uansition  probabilities  of  a 
partial  model  online.  Our  method  can  therefore  be  applied  in 
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practical  contexts  where  we  start  from  a  partial  model  with 
unspecified  probabilities. 

Our  approach  is  based  on  finding  a  policy  that  maximizes 
the  expected  utility  of  an  auxiliary  MDP  constructed  from 
the  original  MDP  and  a  desired  LTL  specification.  As  in 
the  above  mentioned  existing  work,  we  convert  the  LTL 
specification  to  a  deterministic  Rabin  automaton  (DRA)  [11], 
[12],  and  construct  a  product  MDP  such  that  the  states  of  the 
product  MDP  are  pairs  representing  states  of  the  original 
MDP  in  addition  to  states  of  the  DRA  that  encodes  the 
desired  LTL  specification.  The  novelty  of  our  approach  is 
that  we  then  define  a  state  based  reward  function  on  this 
product  MDP  based  on  the  Rabin  acceptance  condition  of 
the  DRA.  We  extend  our  results  to  allow  unknown  Uansition 
probabilities  and  learn  them  online.  Furthermore,  we  select 
the  reward  function  on  the  product  MDP  so  it  corresponds 
to  the  Rabin  acceptance  condition  of  the  LTL  specification. 
Therefore,  any  learning  algorithm  that  optimizes  the  expected 
utility  can  be  applied  to  find  a  policy  that  satisfies  the 
specification. 

We  implement  our  method  using  a  reinforcement  learning 
algorithm  that  finds  the  policy  optimizing  the  expected 
utility  of  every  state  in  the  Rabin-weighted  product  MDP. 
Moreover,  we  prove  that  if  there  exists  a  strategy  that  satisfies 
the  LTL  specification  with  probability  one,  our  method  is 
guaranteed  to  find  such  a  sttategy.  For  situations  where  a 
policy  satisfying  the  LTL  specification  with  probability  one 
does  not  exist,  our  method  finds  reasonable  sttategies.  We 
show  this  performance  for  two  case  studies:  1)  Control  of 
an  agent  in  a  grid  world,  and  2)  Control  of  a  traffic  network 
with  intersections. 

This  paper  is  organized  as  follows:  In  Section  II,  we 
review  necessary  preliminaries.  In  Section  III-A,  we  define 
the  synthesis  problem  and  provide  theoretical  guarantees  in 
finding  a  policy  satisfying  the  specification  for  a  special 
case.  Section  III-B  discusses  a  learning  approach  towards 
finding  an  optimal  controller.  We  provide  two  case  studies 
in  Section  IV.  Finally,  we  conclude  in  Section  V. 

II.  Preliminaries 

We  introduce  preliminaries  on  the  specification  language 
and  the  probabilistic  model  of  the  system.  We  use  Linear 
Temporal  Logic  (LTL)  to  define  desired  specifications.  A 
LTL  formula  is  built  of  atomic  propositions  w  6  II  that 
are  over  states  of  the  system  that  evaluate  to  True  or 
False,  propositional  formulas  f  that  are  composed  of 
atomic  propositions  and  Boolean  operators  such  as  A  (and), 
— '  (negation),  and  temporal  operations  on  <j>.  Some  of  the 


common  temporal  operators  are  defined  as: 

G <j>  <fi  is  true  all  future  moments. 

F (j>  (j)  is  true  some  future  moments. 

X0  (f>  is  true  the  next  moment. 

4>i  is  true  until  <f>2  becomes  true. 

Using  LTL,  we  can  define  interesting  liveness  and  safety 
properties  such  as  surveillance  properties  GF</>,  or  stability 
properties  FG</>. 

Definition  1.  A  deterministic  Rabin  automaton  is  a  tuple 
TZ  =  (Q,E,  S,  qo,  F)  where  Q  is  the  set  of  states;  E  is  the 
input  alphabet;  5  :  Q  X  E  -A  Q  is  the  transition  function;  q o 
is  the  initial  state  and  F  represents  the  acceptance  condition: 
F  =  {(Gi,Bi),...,(G„F,BnF)}  where  Gi,Bi  C  Qfor 
i  =  1, ,  tif- 

A  run  of  a  Rabin  automaton  is  an  infinite  sequence  r  = 
<7o<7i  ■  •  •  where  q0  £  Q0  and  for  all  i  >  0,  qi+ 1  £  S(qi ,  a),  for 
some  input  o  £  E.  For  every  run  r  of  the  Rabin  automaton, 
inf(r)  £  Q  is  the  set  of  states  that  are  visited  infinitely  often 
in  the  sequence  r  =  q0qi ....  A  run  r  =  q0qi ...  is  accepting 
if  there  exists  i  £  {1, . . . ,  Hf}  such  that: 

inf(r)  r\Gi^(6  and  inf(r)  D  Bi  =  0  (1) 

For  any  LTL  formula  <j>  over  II,  a  deterministic  Rabin 
automaton  (DRA)  can  be  constructed  with  input  alphabet 
E  =  2n  that  accepts  all  and  only  words  over  II  that  satisfy 
tj>  [12].  We  let  7 Z^  denote  this  DRA. 

Definition  2.  A  labeled  Markov  Decision  Process  (MDP)  is  a 
tuple  Ml  =  (S,  A ,  P,  so,  II,  L)  where  S  is  a  finite  set  of  states 
of  the  MDP;  A  is  a  finite  set  of  possible  actions  (controls) 
and  A  :  S  — ►  2A  is  defined  as  the  mapping  from  states 
to  actions;  P  is  a  transition  probability  function  defined  as 
P  :  S  x  A  x  S  — >  [0, 1];  so  £  S  is  the  initial  state;  II  is 
a  set  of  atomic  propositions,  and  L  :  S  -A  2n  is  a  labeling 
function  that  labels  a  set  of  states  with  atomic  propositions. 

III.  Synthesis  through  Reward  Maximization 


DRA  IZcf,  =  (Q,E,S,q0,F)  whose  acceptance  condition 
corresponds  to  satisfaction  of  fi.  We  then  obtain  a  policy  7r* 
for  this  composition.  Our  approach  is  particularly  amenable 
to  learning -based  algorithms  as  we  discuss  in  Section  III-B. 
In  particular,  the  policy  tt*  can  be  constructed  even  when  the 
transition  probabilities  P  for  At  are  not  known.  Thus,  we 
present  an  approach  that  allows  the  policy  tt*  to  be  found 
online  while  learning  the  transition  probabilities  of  At. 

We  create  a  Rabin  weighted  product  MDP  V  .defined 
below,  using  the  DRA  TZ$  and  labeled  MDP  A 4.  The  set 
of  states  Sp  in  V  are  a  set  of  augmented  states  with 
components  that  correspond  to  states  in  At  and  components 
that  correspond  to  states  in  7 Z^.  The  set  of  actions  Ap  is 
identical  to  the  set  of  actions  in  AL 

To  this  end,  we  define  a  Rabin  weighted  product  MDP 
given  a  MDP  AI  and  a  DRA  TZ  as  follows: 


Definition  4.  A  Rabin  weighted  product  MDP  or  sim¬ 
ply  product  MDP  between  a  labeled  MDP  Ml  = 
(S,  A,  P,  sq,  II,  L)  and  a  DRA  7 Z  =  (Q,  E,  6,  go,  F)  is 
defined  as  a  tuple  V  =  (Sp,Ap,Pp,spo,Fp,Wp)  [6], 
where: 


•  S-p  =  S  x  Q  is  the  set  of  states. 

•  Ap  provides  the  set  of  control  actions  from  the  MDP: 
Ap{{s,q))  =  A(s). 

•  Pp  is  the  set  of  transition  probabilities  defined  as: 


Pp(sv,a,s'v) 


P(s,  a,  s') 
0 


iff  =  S(q,L(s)) 
otherwise 


Sp  =  ( s,q )  £  Sp  and  s’v  =  (s',qr). 

•  Spo  =  (so,<?o)  G  Sp  is  the  initial  state, 

•  Fp  is  the  acceptance  condition  given  by 


Fp  =  {(g1,B1),...,(GnF,BnF)} 


where  Gi  =  S  x  G.j  and  Bi  =  S  x  Bp 
•  For  the  above  acceptance  condition,  Wp  =  {  TLA, }  is 

a  collection  of  reward  functions  Wfi  :  Sp  -A  R  defined 
by: 


A.  Problem  Formulation 
Consider  a  labeled  MDP 

A1  =  (S,  A,  P,  so,  II,  L)  (2) 

and  a  linear  temporal  logic  specification  <f>. 

Definition  3.  A  policy  for  At  is  a  function  i r  :  S+  — >  A 
such  that  7r(soSi  . . .  sn)  £  A(s„)  for  all  soSi  . . .  sn  £  S+ 
where  S+  denotes  the  set  of  all  finite  sequences  of  states  in 
S. 

Observe  that  a  policy  f  for  an  MDP  A1  induces  a  Markov 
chain  which  we  denote  by  A4W.  A  run  of  a  Markov  chain 
is  an  infinite  sequence  of  states  So,  Si, . . . ,  where  so  is  the 
initial  state  of  the  Markov  chain,  and  for  all  i,  Pis,,  a,  Si+i) 
is  nonzero  for  some  action  a  £  A. 

Our  objective  is  to  compute  a  policy  tt*  for  At  such  that 
the  runs  of  Atw*  satisfy  the  LTL  formula  f  with  probability 
one  as  defined  below.  Our  approach  composes  At  and  the 


{wg  if  sp  £  Gi 
wB  if  sp  £  Bi  (4) 

0  if  sp  £  S\{Qi  U  Bi) 

where  wg  >  0  is  a  positive  reward,  wB  <  0  is  a 
negative  reward. 

We  let  Mi  =  S\(Qi  U  B,j  for  every  pair  of  ( Gi,Bi ). 

We  use  the  notation  V  to  denote  V  with  the  specific 
reward  function  Wfi .  In  seeking  a  policy  tt  for  At  such  that 
AItt  satisfies  cf>,  it  suffices  to  consider  stationary  policies  of 
the  corresponding  Rabin  weighted  product  MDP  [9], 

Definition  5.  A  stationary  policy  tt  for  a  product  MDP  V  is 
a  mapping  tt  :  Sp  — >  Ap  that  maps  every  state  to  actions 
selected  by  policy  tt. 

A  stationary  policy  for  V  corresponds  to  a  finite  memory 
policy  for  At.  We  let  7A  denote  the  Markov  chain  induced 
by  applying  the  stationary  policy  tt  to  the  product  MDP  V . 


Let  r  =  S'p$S'p\S'P2  ...  be  a  run  of  Vn  with  initial  product 
state  s-po- 

Definition  6.  Consider  a  MDP  M.  and  a  LTL  formula  <p  with 
corresponding  DRA  TZ^,  let  V  be  the  corresponding  Rabin 
weighted  MDP,  and  let  tt  be  a  stationary  policy  on  V.  We 
say  that  satisfies  <j>  with  probability  1  if 

Pr({r  :  3 G  Fp(s) 

inf(r)  n  Qi  0  A  inf(r )  n  Bt  =  0})  =  1 

where  r  is  a  run  of  V v  initialized  at  spo- 

Intuitively,  A4n  satisfies  <p  with  probability  one  if  the 
probability  measure  of  the  runs  of  Vn  that  violate  the 
acceptance  condition  of  p  is  0. 

We  let  i  be  index  of  Rabin  acceptance  condition  for  property 
p.  A  reward  function  Wfi(sp)  on  every  state  is  specified 
in  Definition  4  and  can  be  identified  by  W1  G  for 

some  enumeration  of  Sp.  We  assign  a  negative  reward  w  /> 
to  states  sp  G  Bi  =  S  x  Bi  since  we  would  like  to  visit  them 
only  finitely  often.  Similarly  we  assign  positive  rewards  wg 
to  sp  G  Qi,  and  reward  of  0  on  neutral  states  sp  G  TVj  to 
bias  the  policy  towards  satisfaction  of  the  Rabin  automaton’s 
acceptance  condition. 

Definition  7.  For  i  G  {1, . . . ,  n  p } ,  the  expected  discounted 
utility  for  a  policy  tt  on  V1  with  discount  factor  0  <  7  <  1 
is  a  vector  Ut  =  [Ui(s0)...Ul(sN)\for  sk  G  Sv,k  G 
{1, . . . ,  N}  and  N  =  |  1,  such  that: 

OO 

K  =  Y,  (5) 

71=0 

where  W!  is  the  vector  of  the  rewards  Wf(sp)  and  P7 r  is 
a  matrix  containing  the  probabilities  Pp(sp ,  tt(sp),  s'v).  For 
simpler  notation,  we  omit  the  superscript  i  the  index  of  Rabin 
acceptance  condition  of  the  LTL  specification.  In  the  rest  of 
this  paper,  it  is  assumed  that  W  and  are  the  reward  and 
utility  vectors  of  the  product  MDP  with  their  corresponding 
set  of  Rabin  acceptance  condition  pair  (Qi,B[). 

Definition  8.  A  policy  that  maximizes  this  expected  dis¬ 
counted  utility  for  every  state  is  an  optimal  policy  7r*  = 
[•7T*  (so)  •  •  •  7t*(sjv)],  defined  as: 

OO 

tt*  =  arg  max  ^  7nP^W  (6) 

7T  n 

71  =  0 

Note  that  for  any  policy  tt,  for  all  s  G  Sp  U„(s)  < 
Un»(s).  From  a  product  MDP  V,  we  seek  a  policy  that 
satisfies  the  LTL  specification  by  optimizing  the  expected 
future  utility.  Note  that  an  optimal  policy  exists  for  each 
acceptance  condition  (Qi,  £>,  )  G  Fp  and  thus  our  reward 
maximization  algorithm  must  be  run  on  each  acceptance 
condition.  The  outcome  is  a  collection  of  strategies  {7 r*}”^ 
where  7t*  is  the  optimal  policy  under  rewards  Wf .  We  use 
Definition  6  to  determine  whether  a  policy  n*  satisfies  p 
with  probability  one  by  analyzing  properties  of  the  recurrent 
classes  in  V  [9], 


The  following  theorem  shows  that  optimizing  the  expected 
discounted  utility  produces  a  policy  tt  such  that  JA  „  satisfies 
p  with  probability  one  if  such  a  policy  exists. 

Theorem  1.  Given  MDP  M.  and  LTL  formula  p  with 
corresponding  Rabin  weighted  product  MDP  V.  If  there 
exists  a  policy  tt  such  that  satisfies  tf>  with  probability  1, 
then  there  exists  i*  G  {1, . . .  ,np},  7*  G  [0, 1),  and  w*B  <  0 
such  that  any  algorithm  that  optimizes  the  expected  future 
utility  of  V1  with  7  >  7*  and  lUp  <  w*B  will  find  such  a 
policy. 

Proof.  Proof  of  theorem  1  can  be  found  in  Appendix  A. 
Intuitively,  choosing  7  i.e.  the  discount  factor  close  to  1 
enforces  visiting  Q,  infinitely  often,  and  a  large  enough 
negative  reward  w n  enforces  visiting  B,  only  finitely  often. 
This  will  result  in  satisfaction  of  tp  by  our  algorithm.  □ 

Theorem  1  provides  a  practical  approach  to  synthesizing 
a  control  policy  7r*  for  the  MDP  A 4.  After  constructing 
the  corresponding  product  MDP  V,  a  collection  of  policies 
{tt* }™L[  is  computed  that  optimize  the  expected  future 
utility  of  each  V .  Provided  that  7  and  \wB\  are  sufficiently 
large,  if  there  exists  a  policy  7r  such  that  jMr  satisfies  tp 
with  probability  1,  then  for  at  least  one  of  the  computed 
policies  7T*,  M.tt»  satisfies  ip  with  probability  1.  Determining 
which  of  the  policies  satisfy  p  with  probability  1  is  easily 
achieved  by  computing  strongly  connected  components  of 
the  resulting  Markov  chains,  for  which  there  exists  efficient 
graph  theoretic  algorithms  [9], 

In  this  section,  we  have  not  provided  an  explicit  method 
for  optimizing  the  expected  utility  of  the  product  MDP  V.  If 
the  transition  probabilities  of  M  are  not  known  a  priori ,  then 
the  optimization  algorithm  must  simultaneously  learn  the 
transition  probabilities  while  optimizing  the  expected  utility, 
and  tools  from  learning  theory  are  well-suited  for  this  task. 
In  the  following  section,  we  discuss  how  these  tools  apply 
to  the  policy  synthesis  problem  above. 

B.  Synthesis  through  Reinforcement  Learning 

By  translating  the  LTL  synthesis  problem  into  an  expected 
reward  maximization  framework  in  section  III-A,  it  is  now 
possible  to  use  standard  techniques  in  the  reinforcement 
learning  literature  to  find  satisfying  control  policies. 

In  the  previous  section,  we  did  not  provide  an  explicit 
method  for  optimizing  the  expected  utility  of  the  product 
MDP  V.  If  the  transition  probabilities  of  M  are  not  known 
a  priori ,  then  the  optimization  algorithm  must  1)  Learn  the 
transition  probabilities  and  2)  Optimize  the  expected  utility. 
Tools  from  learning  theory  are  well-suited  for  this  task. 

Algorithm  1  below  is  a  modified  active  temporal  difference 
learning  algorithm  [13]  that  accomplishes  these  goals.  It 
is  called  after  each  observed  transition  and  updates  a  set 
of  persistent  variables,  which  include  a  table  of  transition 
frequencies,  state  utilities,  and  the  optimal  policy  that  can 
each  be  initialized  by  the  user  with  a  priori  estimates.  The 
magnitude  of  the  update  is  determined  by  a  learning  rate,  a. 
Algorithm  1  is  customized  to  take  advantage  of  the  structure 
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Algorithm  1  Temporal  Difference  Learning  for  A4p 
Input:  s'v  Current  state  of  V. 

Output:  a'v  Current  action 

Persistent  Values: 

•  Utilities  Up(sp)  for  all  states  of  V  initialized  at  0. 

•  7Vsa([s-p],  a-p)  a  table  of  frequency  of  state,  action  pairs 
initialized  by  the  user. 

•  -^s'lsads-p], a-p,  [sp])  a  table  of  frequency  of  the  out¬ 
come  of  the  equivalence  class  [s^J  for  state,  action  pairs 
in  the  equivalence  class  ([s-p],  a-p)  initialized  by  the  user. 

•  Optimal  Policy  tt*  for  every  state.  Initialized  at  0. 

•  sp,ap  previous  state  and  action,  initialized  as  null 
if  s'v  is  new  then 

Uv{s'v)  <-  Wj,(s'v) 
end  if 

if  ResetConditionMet()  is  True  then 
s'v  =  ResetRabinState(Sp) 
else  if  sp  is  not  NULL  then 

Nsa(lsvj,ap)  <r-  ATsa([s-p], a-p)  +  1 

-^s'|sa([s'p])  aVi  Ispl)  f" “  ^s'lsod^li  aVi  Ispl)  +  1 
for  all  t  that  iVs/|sa([s],  a,  ft})  ^  0  do 

-P(ISL  a>  M)  ‘s— 

-^s'|sa([s], a,  [f])/A^a([s],a) 

end  for 

Up(sp)  £- 

aUv(sv)  +  (l-a)[Wi,(sv) 

+  7ma xaJ2aP{sp,ap,a)U(a)] 

7r*(sp)  <-  argmaxaeAl,(ST,)  P(sv,  ap,  cr)U(a) 

end  if 

Choose  current  action  a'v  =  fexp 

Sp  =  Sp 
d'p  =  a'v 


in  V  to  converge  more  quickly  to  the  actual  transition  prob¬ 
abilities.  Observe  that  product  states  corresponding  to  the 
same  labeled  MDP  state  have  the  same  transition  probability 
structure  i.e.  Pp(sp,a,s'v)  =  Pp(sp,a,  s'v)  if  sp  =  ( s,q ), 
sp  =  (s,  q),  sp  =  (s',  q'),  and  s’v  =  (s',  q'),  where 
q,q',q,q'  £  Q,  and  s,s'  £  S.  Therefore,  every  iteration 
in  the  product  MDP  can  in  fact  be  used  to  update  the 
transition  probability  estimates  for  all  product  MDP  states 
that  share  the  same  labeled  MDP  state.  Thus,  the  algorithm 
uses  equivalence  classes  (JspJ.ap),  where  [s-p]  =  s  x  Q  = 
{sp  =  (s,  q)\q  £  Q}  to  more  quickly  converge  to  the  optimal 
policy. 

Traditionally,  temporal  difference  learning  occurs  over 
multiple  trials  where  the  initial  state  is  reset  after  each 
trial  [14].  Similarly,  in  an  online  application,  where  we 
cannot  reset  the  labeled  MDP  state,  we  periodically  reset  the 
Rabin  component  of  the  product  state  to  Q0.  For  instance, 
if  the  LTL  formula  contains  any  safety  specifications,  then  a 
safety  violation  will  make  it  impossible  to  reach  a  state  with 
positive  reward  in  V.  To  ensure  we  obtain  a  correct  control 
action  for  every  state  we  introduce  a  function  “ResetCon- 
ditionMetQ”  in  Algorithm  1  that  forces  a  Rabin  state  reset 
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Fig.  1.  A  grid  world  example  with  a  superimposed  sample  trajectory 
under  the  policy  7r*  generated  by  the  reinforcement  learning  algorithm. 
The  trajectory  has  a  length  of  1000  time  steps  and  an  initial  location  (0,3) 
denoted  by  a  solid  square.  The  arrows  denote  movement  from  the  box 
containing  the  arrow  to  a  corresponding  adjacent  state.  Locations  (3,0)  and 
(4,0)  do  not  have  any  arrows  because  they  are  not  reachable  from  the  initial 
state  under  our  policy.  Note  that  7r*  is  deterministic,  but  may  cause  a  single 
location  on  the  grid  (e.g.  location  (4,2))  to  have  different  actions  under 
different  Rabin  states. 

whenever  a  safety  violation  is  detected,  or  heuristically  after 
a  set  time  interval  if  liveness  properties  are  not  being  met. 
In  both  case  studies,  we  observed  that  this  reset  technique 
results  in  Algorithm  1  converging  to  a  satisfying  policy. 

We  note  that  online  learning  algorithms  on  general  MDPs 
do  not  have  hard  convergence  guarantees  to  the  opti¬ 
mal  policy  because  of  the  exploitation  versus  exploration 
dilemma  [13].  A  learning  agent  decides  whether  to  explore  or 
exploit  via  the  exploration  function  fexp.  One  possible  explo¬ 
ration  function  for  probably  approximately  correct  learning 
observes  transitions  and  builds  an  internal  model  of  the 
transition  probabilities.  The  agent  defaults  to  an  exploration 
mode  and  only  explores  if  it  can  learn  more  about  the  system 
dynamics  [15]. 

IV.  Case  Studies 
A.  Control  of  an  agent  in  a  grid  world 

For  illustrative  purposes,  we  consider  an  agent  in  a  5  x  5 
grid  world  that  is  required  to  visit  regions  labeled  A  and  B 
infinitely  often,  while  avoiding  region  C.  The  LTL  specifi¬ 
cation  is  given  as  the  following  formula: 

GFA  A  GF B  A  G^C  (7) 

The  agent  is  allowed  four  actions,  where  each  one  ex¬ 
presses  a  preference  for  a  diagonal  direction.  An  “upper 
right”  action  will  cause  the  agent  to  move  right  with  prob¬ 
ability  0.4,  up  with  probability  0.4,  and  remain  stationary 
with  probability  0.2.  If  a  wall  is  located  to  the  agent’s  right 
then  it  will  move  up  with  probability  0.8,  if  one  is  located 
above  then  it  will  move  to  the  right  with  probability  0.8,  and 
if  the  agent  is  in  the  upper  right  corner,  then  it  is  guaranteed 
to  remain  in  the  same  location.  The  dynamics  for  the  other 
actions  are  identical  after  an  appropriate  rotation. 


Fig.  2.  A  traffic  network  consisting  of  East- West  links  1  and  2  and  North- 
South  links  3  and  4  and  two  signalized  intersections.  The  gray  links  are  not 
explicitly  modeled. 


Figure  1  shows  the  results  of  the  learning  algorithm  with 
an  exploration  function  /exp(-)  that  simply  outputs  random 
actions  while  learning.  The  product  MDP  contained  150 
states  and  one  acceptance  pair,  Qi  =  500,  £>;  =  —500  and 
7  =  0.98.  There  were  600  trials,  which  are  separated  by  a 
Rabin  reset  every  200  time  steps. 

Observe  that  no  policy  exists  such  that  f  is  satisfied  for 
all  runs  of  the  MDR  For  example,  it  is  possible  that  every 
action  results  in  no  movement  of  the  robot.  However,  it  is 
clear  that  there  exists  a  policy  that  satisfies  f  with  probability 
1,  thus  this  example  satisfies  the  conditions  for  Theorem  1. 

B.  Control  of  a  Traffic  Network  with  Two  Intersections 

To  demonstrate  the  utility  of  our  approach,  we  apply  our 
control  synthesis  algorithm  to  a  traffic  network  with  two 
signalized  intersections  as  depicted  in  Figure  2.  We  employ 
a  traffic  flow  model  with  a  time  step  of  15  seconds.  At  each 
discrete  time  step,  signal  V\  either  actuates  link  1  or  link 
3,  and  signal  V2  actuates  link  2  or  link  4.  For  i  =  1,2, 
the  Boolean  variable  sVi  is  equal  to  1  if  link  i  is  actuated 
at  signal  Vi  and  is  equal  to  0  otherwise.  The  set  of  control 
actions  is  then 

A  —  {(1)  2),  (1, 4),  (3, 2),  (3, 4)}  (8) 

where,  for  a  £  A,  l  £  a  implies  that  link  l  is  actuated.  The 
gray  links  in  Fig.  2  are  not  explicitly  considered  in  the  model 
as  they  carry  traffic  out  of  the  network. 

The  model  considers  a  queue  of  vehicles  waiting  on  each 
link,  and  at  each  time  step,  the  queue  is  forwarded  to 
downstream  links  if  the  queue’s  link  is  actuated  and  if  there 
is  available  road  space  downstream.  If  the  queue  is  longer 
than  some  saturating  limit,  then  only  this  limit  is  forwarded 
and  the  remainder  remains  enqueue  for  the  next  time  step. 
The  vehicles  that  are  forwarded  divide  among  downstream 
links  via  turn  ratios  given  with  the  model. 

Let  Ci  >  0  be  the  capacity  of  link  l.  Here,  the  queue 
length  is  assumed  to  take  on  continuous  values.  To  obtain 
a  discrete  model,  the  interval  [0,  Ci]  C  R  is  divided  into  a 
finite,  disjoint  set  of  subintervals.  For  example,  if  link  l  can 
accomodate  up  to  Ci  =  40  vehicles,  we  may  divide  [0, 40] 
into  the  set  {[0, 10],  (10,  20],  (20, 30],  (30, 40]}.  The  current 
discrete  state  of  link  l  is  then  the  subinterval  that  contains 
the  current  queue  length  of  link  l,  and  the  total  state  of  the 
network  is  the  collection  of  current  subintervals  containing 
the  current  queue  lengths  of  each  link. 

Here,  we  consider  probabilistic  transitions  among  the 
discrete  states  and  obtain  an  MDP  model  with  control  actions 


A  as  defined  in  (8).  For  the  example  in  Fig.  2,  we  have 

(C1,C2,C3,Ci)  =  (40,50,30,30)  (9) 

and  link  1  is  divided  into  four  subintervals,  link  2  is  divided 
into  five  subintervals,  and  links  3  and  4  are  divided  into  two 
subintervals  each.  In  addition,  we  augment  the  state  space 
with  the  last  applied  control  action  so  that  the  control  ob¬ 
jective,  expressed  as  a  LTL  formula,  may  include  conditions 
on  the  traffic  lights  as  is  the  case  below,  thus  there  are  320 
total  discrete  states.  The  transition  probabilities  for  the  MDP 
model  are  determined  by  the  specific  subintervals,  saturating 
limits,  and  turn  ratios.  Future  research  will  investigate  the 
details  of  abstracting  the  traffic  dynamics  to  an  MDP. 

Let  Xi  for  i  =  1, . . . ,  4  denote  the  number  of  vehicles  en¬ 
queue  on  link  i.  We  consider  the  following  control  objective: 

FG(n  <  30  Ai2  <  30)A  (10) 

GF(i3  <  10)  A  GF(i4  <  10)A  (11) 

G((s„2AXK2))  =►  (XXK)AXXX(^2))). 

(12) 

In  words,  (10)-(12)  is 

(Eventually  links  1  and  2  have  adequate  supply)  and 
(Infinitely  often,  links  3  and  4  have  short  queues)  and 
(When  signal  V2  actuates  link  4, 
it  does  so  for  a  minimum  of  3  times  steps) 

where  “adequate  supply”  means  the  number  of  vehicles  on 
links  1  and  2  does  not  exceed  30  vehicles  and  thus  can 
always  accept  incoming  traffic,  and  a  queue  is  “short”  if  the 
queue  length  is  less  than  10.  Condition  (12)  is  a  minimum 
green  time  for  actuation  of  link  4  at  signal  2  and  may  be 
necessary  if,  e.g.,  there  is  a  pedestrian  crosswalk  across  link  2 
which  requires  at  least  45  seconds  (three  time  steps)  for  safe 
crossing  (recall  that  sV2  =  1  when  link  2  is  actuated).  The 
above  condition  is  encoded  in  a  Rabin  automaton  with  one 
acceptance  pair  and  37  states.  The  Rabin-weighted  product 
MDP  contains  11,840  states  and  rewards  corresponding  to 
the  one  acceptance  pair. 

In  Fig.  3,  we  explore  how  our  approach  can  be  used  to 
synthesize  a  control  policy.  Restating  (10)— (12),  the  control 
objective  requires  the  two  solid  traces  to  eventually  remain 
below  the  threshold  at  30  vehicles  and  for  the  two  dashed 
traces  to  infinitely  often  move  below  the  threshold  at  10 
vehicles.  Additionally,  signal  2  should  be  red  for  at  least 
three  consecutive  time  steps  whenever  it  switches  from  green 
to  red. 

Fig.  3(a)  shows  a  naive  control  policy  that  synchronously 
actuates  each  link  for  3  time  steps  but  does  not  satisfy  (f> 
since  X2  remains  above  30  vehicles.  If  estimates  of  turn  ratios 
and  saturation  limits  are  available  from,  e.g.,  historical  data, 
then  we  can  obtain  a  MDP  that  approximates  the  true  traffic 
dynamics  and  determine  the  optimal  control  policy  for  the 
corresponding  Rabin-weighted  product  MDP.  When  applied 
to  the  true  traffic  model,  the  controller  greatly  outperforms 
the  naive  policy  but  still  does  not  satisfy  <f>,  as  shown  in  Fig. 


3(b).  However,  by  modifying  this  policy  via  reinforcement 
learning  on  the  true  traffic  dynamics,  we  obtain  a  controller 
that  empirically  often  satisfies  (f>  as  seen  in  Fig.  3(c)  (Note 
that  we  should  not  expect  </>  to  be  satisfied  for  all  traces  of 
the  MDP  or  all  disturbance  inputs  as  such  a  controller  may 
not  exist). 

This  example  suggests  how  our  approach  can  be  utilized 
in  practice:  a  “reasonable”  controller  can  be  obtained  by 
using  a  Rabin-weighted  MDP  generated  from  approximated 
traffic  parameters.  This  policy  can  then  be  modified  online 
to  obtain  a  control  policy  that  better  accommodates  existing 
conditions.  Additionally,  using  a  suboptimal  controller  prior 
to  learning  is  rarely  of  serious  concern  for  traffic  control  as 
the  cost  is  only  increased  delay  and  congestion. 

V.  Conclusion 

We  have  proposed  a  method  for  synthesizing  a  control 
policy  for  a  MDP  such  that  traces  of  the  MDP  satisfy  a 
control  objective  expressed  as  a  LTL  formula.  We  proved 
that  our  synthesis  method  is  guaranteed  to  return  a  controller 
that  satisfies  the  LTL  formula  with  probability  one  if  such  a 
controller  exists.  We  provided  two  case  studies:  In  the  first 
case  study,  we  utilize  the  proposed  method  to  synthesize  a 
control  policy  for  a  virtual  agent  in  a  gridded  environment, 
and  in  the  second  case  study,  we  synthesize  a  traffic  signal 
controller  for  a  small  traffic  network  with  two  signalized 
intersections. 

The  most  immediate  direction  for  future  research  is  to 
investigate  theoretical  guarantees  in  the  case  when  the  LTL 
specification  cannot  be  satisfied  with  probability  one.  For 
example,  it  is  desirable  to  prove  or  disprove  the  conjecture 
that  for  appropriate  weightings  in  the  reward  function,  our 
proposed  method  finds  the  control  policy  that  maximizes  the 
probability  of  satisfying  the  LTL  specification.  In  the  event 
that  the  conjecture  is  not  true,  we  wish  to  identify  fragments 
of  LTL  for  which  the  conjecture  holds.  Future  research  will 
also  explore  other  application  areas  such  as  human-in-the- 
loop  semiautonomous  driving. 

Appendix 

A.  Proof  of  Theorem  1 

Proof  Suppose  n  satisfies  0  with  probability  1,  then  the 
set  of  states  of  Mp ^  written  as  MC #  can  be  represented 
as  a  disjoint  union  of  10  transient  states  and  Rf  closed 
irreducible  sets  of  recurrent  classes  [16]: 

MC#  =  U  R\  U  . . .  U  (13) 

Proposition  1.  Policy  n  satisfies  0  with  probability  1  if  and 
only  if  there  exits  (Gi,Bi)  £  Fp  such  that  Bi  £  T 0  and 
Rf  IT  Qi  7^  0  for  all  recurrent  classes  R f. 

We  omit  the  proof  of  Proposition  1;  however,  it  readily 
follows  Definition  6. 

Let  n*  be  the  finite  set  of  optimal  policies  that  optimize 
the  expected  future  utility.  We  constructively  show  that  for 
large  enough  values  of  7,  the  discount  factor  and  wp,  the 
negative  reward  on  non  accepting  states,  all  policies  it*  £  n* 
satisfy  </>  with  probability  1. 
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Fig.  3.  Sample  trajectories  of  the  traffic  network  in  Fig.  2.  (a)  A  simple 
controller  that  synchronously  actuates  links  for  3  time  periods  and  does 
not  satisfy  cj).  (b)  An  optimal  controller  for  an  MDP  obtained  from  an 
approximate  model  of  the  traffic  dynamics  ( e.g .,  a  model  with  turn  ratios 
and  saturation  limits  different  than  reality).  This  controller  outperforms  the 
previous  naive  controller,  but  does  not  fully  satisfy  cj).  (c)  The  controller 
from  (b)  is  modified  via  reinforcement  learning  on  the  true  traffic  model.  In 
the  lower  plot  for  all  cases,  signal  i  for  i  =  1,  2  is  green  if  link  i  is  actuated 
and  is  red  otherwise.  This  example  suggests  how  a  reasonable  control  policy 
can  be  obtained  from  an  approximate  MDP  estimated  via,  e.g.,  historical 
data  and  modified  “online”  using  reinforcement  learning  on  observed  traffic 
dynamics. 


Suppose  n*  £  n*  does  not  satisfy  0.  Then  one  of  the 
following  two  cases  must  be  true: 

•  Case  1:  There  exists  a  recurrent  class  Ilf  ,  such  that 
Rf,  n  Gi  =$■  This  means  with  policy  n*  it  is  possible 
to  visit  Gi  only  finitely  often. 

•  Case  2:  There  exists  b  £  Bi  such  that  b  is  recurrent. 
That  is  for  some  recurrent  class  of  the  Mp>n*,  b  £  Rf, . 
This  translates  to  the  possibility  of  visiting  a  state  in  Bi 
infinitely  often. 

We  let  n*  =  m  un2,  where  ni(n2)  is  the  set  of  optimal 
policies  that  do  not  satisfy  <j>  by  violating  Case  1  (Case  2). 


Notice  that  this  is  not  a  disjoint  union. 

In  addition,  we  know  that  the  vector  of  utilities  for  any 
policy  n*  £  II*  is  14*  €  Rff,  where  N  =  \MC„*\  is  the 
number  of  states  of  Mp)W»: 

U.*=Er=o7^n.W  (14) 

In  this  equation  U,.  =  [E4*  (so)  •  •  •  Un*  (stv)]T  and  W  = 
[H^(s0) . . .  VF(sat)]t  and  Pn*  is  the  transition  probability 
matrix  with  entries  pn*(si,Sj)  which  are  the  probability  of 
transitioning  from  st  to  s3  using  policy  7r*. 

We  partition  the  vectors  in  equation  (14)  into  its  transient 
and  recurrent  classes: 

[U*,l  =  V  ,  [  iv  (T,  T)  [pfr  . . .  P£r]  1 n  [  vr  1 

u;e*J  [o(Er_iJViX9)  P.*(P,P)  \  [w- 

(15) 

In  equation  (15),  14 ,  is  a  vector  representing  the  utility 
of  every  transient  state.  Assuming  we  have  q  transient  states, 
Pv*  (T,  T)  is  a  q  x  q  probability  transition  matrix  containing 
the  probability  of  transitioning  from  one  transient  state  to 
another.  Assuming  there  are  rn  different  recurrent  classes, 
°(E”i  Nixq)  is  a  zero  matrix  representing  the  probability 
of  transitioning  from  any  of  the  to  recurrent  classes,  each 
with  size  Nt  to  any  of  the  transient  states.  This  probability 
is  equal  to  0  for  all  of  these  entries. 

On  the  other  hand,  P^*  =  [P*.1  . . .  P*r«m]  is  a  qx 
Ei=i  Ni  matrix,  where  each  P*Zk  is  a  q  x  Nk  matrix  whose 
elements  denote  the  probability  of  transitioning  from  any 
transient  state  tj,  j  £  {l,...,q}  to  every  state  of  the  fcth 
recurrent  class  P£«. 

Finally,  P„.»(P,  R)  is  a  block  diagonal  matrix  with  to 
blocks  of  size  EEi  x  EEi  for  every  recurrent 
class  that  states  the  probabilities  of  transitioning  from  one 
recurrent  state  to  another.  It  is  clear  that  P (P,  R)  is  a 
stochastic  matrix  since  each  block  of  iVj  x  N,  is  a  stochastic 
matrix  [16].  From  equation  (15),  we  can  conclude: 

00  r  wtr " 

U-=^7”[0  P*-(R,R)n]  ^rec  (16) 

n— 0  ^ 

oo 

=  7np”*  (-R,  P)  wrec  (17) 

n— 0 

Also  with  some  approximations,  a  lower  bound  on  U£* 
can  be  found: 

00  ["  ■\iytr " 

Yin  p„.p?.{r,r)]  w-  <u**  <18) 

n— 0  L 

oo  oo 

Y  7 nP™  (T,  T)Wtr  +  Y  7"P*-i?.  (R,  ^)Wrec  <  U*, 

n— 0  n=0 

(19) 

Case  1: 

We  first  consider  all  policies  ir*  £  III.  These  are  policies 
that  violate  case  1,  thus  for  n*  there  exists  some  j  such  that 
Pi*  FI  Qi  =  0.  We  choose  any  state  s  £  Pi* .  Then  we  use 


equation  (16)  to  show  that  any  policy  ir*  over  state  s  has  a 
non-positive  utility  Un*  ( s )  <  0. 

In  equation  (20),  p  =  Ej“i  o  N.v  k2  =  E”l*+ i  Ni-  P"*‘ 

is  the  vector  that  corresponds  to  transition  probabilities  from 
s  £  Pi,  to  any  other  state  in  the  same  recurrent  class 
using  policy  tt* .  Wj  =  [1 'F’(sj) . . .  ■)]  is  the  vector 

for  the  reward  values  of  the  recurrent  class  Pi, .  Since  none 
of  these  states  are  in  Q.t,  we  conclude  that  for  all  elements 
w  £  Wj,  w  <  0. 


(4* (*)  =££?(«)  =  £7"  [°fcix9  p?  0fc2X9]  Wrec 

71  =  0 

(20) 

OO 

=  £  7"P^W,  <  0  -=7  J7„. (s)  <  0  (21) 

71=0 

We  first  consider  the  case  that  s  is  in  a  recurrent  class  of 

MC*. 


•  If  s  is  in  some  recurrent  class  s  £  Pi,  by  proposition  1, 
Pi  n  Qi  0.  Therefore,  there  is  at  least  one  sg  £  Qi 
such  that  sg  £  Pi  and  s  £  Pi.  In  addition,  we  know 
that  all  states  in  Bt  are  in  the  transient  class.  Therefore 
the  vector  of  rewards  in  this  recurrent  class  Wj  as 
defined  previously  contains  non-negative  elements.  That 
is  for  all  elements  w  £  Wj,  0  <  w  and  there  exists  at 
least  one  wg  £  Wj,  0  <  wg. 

OO 

0  <  £  7np"J  Wj  =7  0  <  £4(s)  (22) 

71=0 

We  have  shown  that  for  some  s,  and  any  policy  tt*  £ 
III,  Uir*  (s)  <  Ujr(s)  which  contradicts  the  optimality 
assumption  of  7r*  for  the  case  where  s  £  Pi-  Thus,  we 
must  have  that  s  is  in  a  transient  class  of  M Cf . 

•  If  s  is  in  a  transient  class  s  £  T#,  we  first  find  a 
lower  bound  on  C/?(s),  and  show  this  lower  bound  can 
be  greater  than  any  positive  number  for  large  enough 
choice  of  7.  Note  that  at  minimum  all  the  states  in  the 
transient  set  of  n  will  have  utility  of  wb  <  0,  that  is 
Wtrans  =  W b  =  [wb  . .  .Mb],  and  there  will  be  only 
one  state  sg  £  Qi  that  lives  in  the  recurrent  class.  That 
is  wq  7  Wrec  has  a  positive  reward. 

Proposition  2.  For  transient  states  l  \ .  t%  7  T,  there 
exists  N  <  00  such  that: 

OO 

£>"(*!, t2)  <N,  (23) 

71=0 

that  is,  the  infinite  sum  is  bounded  [16], 

We  assume  q  :=  \Tn\  is  the  number  of  transient  states. 
In  addition,  P^(P,  P)  is  a  stochastic  matrix  with  row 
sum  of  1  [16]. 


E  7 nP?(T,  T)  Wtr  +  7nP *Pg(R,  R)  wrec  <  U“ 


Ur-  =  ^7n^"(^^)Wrt 


^lIqxqWB  +  ]T  rV*P2(R,  ^)WreC  <  U*  (25) 

n— 0 

Proposition  3.  Ifpn{s,  s)  is  the  probability  of  returning 
from  a  state  s  to  itself  in  n  time  steps,  there  exists  a 
lower  bound  on  Y^=o  7"p”(s,  s). 

First,  there  exists  n  such  that  pn{s,s)  is  nonzero  and 
bounded.  That  is  s  visits  itself  after  n  time  steps  with  a 
nonzero  probability. 

Also  we  know  {pn(s,s))n  <  pnn(s,s).  Therefore: 

OO  OO 

E  7 npn(s,  s)  >  E  7"Vfi(«,  s)  (26) 

n= 0  n—0 

oo 

>E(7  TEM)"  (27) 


>  , - « V  (28) 

1  —  7n 

Going  back  to  equation  (24),  we  find  a  stricter  lower 
bound  on  the  utility  of  every  state  U£(s)  using  propo¬ 
sition  3: 


N\WB  +  - - =m  <  U9(s)  =  U‘f  (s)  (29) 

1  —  7” 

If  0  <  NiWb  - -m  (30) 

1  —  7” 

=►  U„,(s)<Uw(s)  (31) 

Here  fh  =  max(M)  and  M  <  PjtP'Wlec,  where  P  is 
a  block  matrix  whose  nonzero  elements  are  p  bounds 
derived  from  proposition  3. 

For  a  fixed  w B,  we  can  select  a  large  enough  7  so 
equation  (30)  holds  for  all  it*  £  This  condition 
implies  equation  (31)  which  contradicts  with  optimality 
of  any  n*  £  II].  Therefore,  it*  cannot  be  optimal  unless 
it  visits  Q,  infinitely  often. 

Case  2: 

Now  we  consider  case  2,  where  it*  £  n2.  Here  for  some 
b  £  Bi,  b  £  Rf , .  In  addition,  this  state  is  in  the  transient 
class  of  7r,  b  £  rT~.  Using  the  same  procedure  as  the  previous 
case,  we  find  the  following  upper  bound. 


u£  >  EtT^^w6 

n— 0 
00 

>  Ep*(T’T)wtr 


(Proposition  2)  >(V2IqxqWB  (34) 

>N2wb  (35) 

We  know  that  b  is  in  the  recurrent  class  while  using  policy 
7 r*.  So  we  can  use  equation  (16)  to  find  a  bound  on  the 
utility.  An  upper  bound  assumes  that  all  the  other  states  in 
the  recurrent  class  have  positive  reward  of  iug- 


u™(b)  <  E  7 nu>G  +  E  7X*  C b ,  b)wB  (37) 

n—0  n= 0 

1  00 

<  WG  YZ - h  WB  X!  7 >"*  (fe,  fe)  (38) 

7  n= 0 

If  the  following  condition  in  equation  (39)  holds,  we 
conclude  that  for  a  state  b ,  Un*(b)  <  U^fb)  which  violates 
the  optimality  of  n*. 

1  OO 

E7tt*  (6)  <  fflG  yR - h  Wb  X]  7nK*  <  E4(6) 


We  only  need  to  enforce: 

1  OO 

%  yZ - h  Wb  X]  7>"*  (&,  6)  <  (40) 

7  n=0 

Since  there  are  only  a  finite  number  of  policies  in  n2, 
from  all  policies  7t*  £  n2,  we  can  find  p  such  that: 

OO  OO 

E7X.(M)<E7>‘  (41) 

n—0  n—0 

Therefore  equation  (40)  can  be  simplified: 

1  00 

wg- - 1-  wB  E  7 nP  <  n2 wb  (42) 

1  —  7 

'  n= 0 

WG  - 1-  WB  — P  <  N2WB  (43) 

1  —  7  1  —  7 

(■ WG  +  w Bp)  ( z  — — )  <  N2WB  (44) 

1-7 

{wg  +  wBp )  -  N2WB{1  -  7)  <  0  (45) 

We  assumed  without  loss  of  generality  wG  =  1.  For  a 
fixed  value  of  7,  we  choose  wb  small  enough  so  all  n*  £  n2 
satisfy  equation  (45)  and  violate  the  optimality  condition. 

As  a  result,  any  optimal  policy  must  satisfy  case  2,  which 
is  visiting  a  state  in  B,  only  finitely  often. 

For  optimal  policies  7r*  £  ^  n  n2,  we  need  to  find  7 
and  wb  such  that  both  conditions  for  case  1  and  case  2  are 
satisfied.  That  is: 


I  0  <  Niwb(1  —  7”)  +  M 
\  (1  +  wBp)  -  N2wb{  1  -  7)  <  0 

We  select  a  pair  of  7  and  wB  so  the  system  of  equations 
in  (46)  is  satisfied.  This  solution  can  be  found  as  follows: 
First,  for  a  small  real  number  0  <  e  <  M,  we  select  w*B 


1  +  w*Bp  <  -e  (47) 

Then,  7*  is  selected  so  the  following  holds: 

max{—Niw%(l  -  (7*)"),  -N2w*B{  1  -  7*)}  <  e  (48) 
The  pair  of  (w*B,  7*)  satisfy  equation  (46),  and  as  a  result 


none  of  the  policies  7r*  £  II*  are  optimal.  □ 
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